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AUTOMATED GENERATION OF TEXT ANALYSIS SYSTEMS 



BACKGROUND OF THE INVENTION 

Text analysis is an area of computer science that focuses on processing text to extract 

information through pattern recognition. The decade of the 1990's has seen an unprecedented 
explosion in work on learning methods for text analysis. Prior text analysis methods rely on 
unsupervised learning, where the system is responsible for teasing generalizations from texts or 
samples. One such system, the HASTEN system described in "SRA: Description of the SRA 
System as Used for MUC-6/' Krupka, George R., pp. 221-235, Proceedings Sixth Message 
Understanding Conference (MUC-6), November 1995 (referred to herein as Krupka). Kxupka 
teaches a system for grouping text samples supplied and labeled by users and creating data structures 
called e-graphs. The system in Krupka then uses a similarity metric to decide if portions of an input 
text are related to e-graphs that have been created. It applies these collections of e-graphs, called 
collectors, as sequential processing phases, in order to match each sample set to the input text. 
Generalization of the elements of e-graphs is performed manually by the developer. There is no 
notion of generating grammar rules from e-graphs. The work does not estabUsh a method for 
converting the collectors to rule-based passes of a text analyzer. The work does not describe a way 
to automatically generate substantial portions of a text analyzer. The system in Krupka requires a 
large amount of user interaction to perform tasks manually beyond adding and labeling samples, and 
was applied specifically to create an event level pattern for MUC text analysis. However, Krupka's 
system does not teach a general and fully automated text analyzer capabiUty. 

Another text analysis system is disclosed in Huffman (U.S. Patents 5,796,926 and 
5,841,895). The Huffinan patents deal with text extraction at the event level and teach methods for 
locating potential event patterns of interest. In essence, Huffinan teaches a rigid, inflexible method 
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of searching for specific patterns such as "actor acts on object." 

There is a need for a system that automatically generates text analysis systems with minimal 
training samples while retaining sufficient intelhgence to recognize pattems beyond those described 
by the training samples, sufficiently flexible to allow adaptation to a variety of applications. 

5 

SUMMARY OF THE INVENTION 

An embodiment of the present invention includes a generator program 106 that utilizes a 

hierarchy of user-supplied samples and a text analyzer firamework to create complete text analyzer 
programs. The hierarchy and fi*amework are related in that the top-level concepts of the hierarchy 
1 0 are associated with stubs, or empty regions of passes, in the text analyzer fi-amework. The generator 

O program fills these stub regions with text analyzer passes generated fi:om samples in the hierarchy. 
A user guides the conversion of the samples to generalized rules for recognizing not only the given 
samples, but also related pattems that are processed at a later time. Users may supply additional 

;3 samples in order to process novel pattems that were not anticipated when the initial text analyzer was 
15g: created. When a text analysis system according to the present invention fails to identify a pattern, 

ry a user can simply highlight the unrecognized sample in text and label its components, if necessary, 

U to enable the generator to create a new text analyzer that now recognizes the new sample and related 
samples processed at a later time. Rather than using a similarity metric, an embodiment of the 
present invention applies rules that have been automatically generated from samples. 

20 BRIEF DESCRIPTION OF THE DRAWINGS: 

FIG. 1 is an illustration in block diagram form of various components of an embodiment of the 

present invention; 

FIG. 2 is a flow chart illustrating the steps executed by a text analyzer program produced by the 
present invention; 

25 FIG. 3 is an illustration of a parse tree data structure created and maintained by the present invention; 
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FIG. 4 shows a sample hierarchy constructed and used by the present invention; 

FIG. 5 is a flow diagram illustrating the major steps of rule generation and rule merging according 
to methods of the present invention; 

FIG. 6 is an illustration showing the addition of a pass to the sequence of steps executed by the text 
5 analyzer produced by the present invention; 

FIG. 7 is an illustration of a parse tree data structure modified by a partial analysis step performed 
by the present invention; 

FIG. 8 is an illustration of an updated sample hierarchy; 

FIG. 9 shows the addition of a second pass to the sequence of steps executed by the analyzer; 

10^3 FIG. 10 is an illustration of a parse tree data structure modified by a second partial analysis step 
' J performed by the present invention; 

m FIG. 1 1 A illustrates the relationship of various types of rules in the generator program; 
Q FIG. 1 IB illustrates the logical sequence of steps for generahzing and merging rules; 

FIG. 12 illustrates a user interface that allows a user to operate the generator program; 
15 O FIG. 13 illustrates the association of a text sample with the sample hierarchy; 

FIG. 14 illustrates how a user labels the components a sample via the user interface; 

FIG. 15 illustrates a form tool used in connection with the generator program; 

FIG. 16 illustrates the properties window used in connection with the user interface; 

FIG. 17 illustrates the attributes window used in connection with the user interface; and 

20 FIG. 18 illustrates a menu incorporated into the sample manager for managing samples and 
integrating them with the text analyzer development environment; 

FIG. 19 illustrates a sequence of steps placed into a stub region by the generator, along with the 
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rules generated for one of the steps, 

DETAILED DESCRIPTION 

5 Directing attention to the drawings, FIG.l is a high level block diagram of the hardware 

typically used in an embodiment of the present invention. Computer 1 00 may have a conventional 
design, incorporating a processor 102 utilizing a central processing unit (CPU) and supporting 
integrated circuitry. Memory 104 may include RAM and NVRAM such as flash memory, to 
facilitate storage of computer programs executed by processor 102, such as generator program 106. 
1 0 Also included in computer 1 00 are keyboard 108, pointing device 1 1 0, and monitor 112, which allow 
a user to interact with program 106 during execution. Mass storage devices such as disk drive 114 
Cl and CD ROM 116 may also be incorporated in the computer 100 to provide storage for generator 
O program 106 and associated files. Computer 100 may communicate with other computers via 
modem 118 and telephone hne 120 to allow generator program 106 to be operated remotely, or 
15;^' utihze files stored at different locations. Other media may also be used in place of modem 118 and 
m telephone line 120, such as a direct connection or high speed data line. The components described 
\| above may be operatively connected by a communications bus 122. 

Generator program 106 produces text analyzer programs by generating rules from samples 
supphed by users to create individual passes of a multi-pass text analyzer. A sample is a piece of 

20 text that users have decided is a imit of interest, such as a name or idiomatic phrase. A sample 
hierarchy is an index for storing all user-added samples. A rule is a representation for a pattern of 
interest, which may include associated actions to ensure that the pattern has matched correctly and 
to record the match in the parse tree. A rule typically associates a concept with a pattern or phrase. 
When the pattem matches a hst of nodes, the matched nodes of the parse tree are condensed or 

25 reduced to a node associated with the concept. 
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As used herein, a pass is one step of a multi-step analyzer, in which the generator program 
106 traverses a parse tree to execute a set of rules associated with the pass. As used herein, a parse 
tree is a tree data structure constructed and maintained by the generator program 1 06 to organize text 
and all the patterns that have been recognized within the text. Successive passes are created in a 
cascading fashion by performing partial text analyses employing existing passes. The resulting text 
analyzer program interleaves the generated passes with a framework of existing passes. The 
complete text analysis system can then process text to identify patterns similar to samples added by 
users. Generation of rules from samples encompasses a wide range of constructs and granularities 
that occur in text, from individual words to intrasentential patterns (such as a grammar), to sentential, 
paragraph, section, and other formats that occur in text documents. 

To exemphfy the methods and data structures of the present invention, we use simple 
telephone number patterns such as 

497-5318 
(949) 497-5318 

Home: (949) 497-5318 (1) 

FIG. 2 shows a resulting text analyzer program produced by an embodiment of the present 
invention. Text analyzer program 200 contains three passes. The first pass, tokenize (202), 
processes an input text to group the characters according to alphabetic, numeric, white space, and 
punctuation imits, referred to herein as tokens. The tokens are all placed into a parse tree data 
structure 300 (FIG. 3). The parse tree 300 is used and modified by subsequent passes. The phrases 
pass (204) is a stub, or empty placeholder, for the passes that the generator program 106 creates 
using user-suppUed samples in a sample hierarchy. Since there are no passes there initially, this 
placeholder pass has no effect on the parse tree 300. Finally, the output pass (206), displays a 
representation of the parse tree 300. 

Given the sample input text: 
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Home: (949) 497-5318 (2) 
The output pass displays the parse tree 300 as illustrated in FIG. 3. 

Text analyzer program 200 has no knowledge of telephone number patterns. If a user wants 
phone numbers to be grouped under a concept called phone, a sample hierarchy as shown in FIG. 
4 can be constructed. This hierarchy of samples 400 has a top-level stub concept called phrases 402, 
which matches the stub region 204 to be filled within the text analyzer 200. The user creates a rule 
concept called phone 404, in order to place telephone number samples under it. Under the rule 
concept 404, the user places telephone number samples such as "497-5318" and labels their 
components with the arbitrary names prefix 406 corresponding to "497" and suffix 408 
corresponding to "53 18," which are referred to herein as label concepts 410. Such samples can be 
added in a simple fashion with a user interface that allows highlighting the complete text and its 
components. 

Generator program 106 can be invoked to generate a new analyzer by executing the sequence 
of steps 500 illustrated in the FIG. 5. Generator program 106 first traverses the sample hierarchy 
400 to find the rule concept called phone 404 (step 502). If a rule concept is found (decision step 
504), it traverses the samples corresponding to the rule concept found, encountering the phone 
number 497-53 1 8 (step 506). The generator program 106 executes a partial analysis (step 508) of 
the text containing the current sample. Because a pass will be generated for the current rule concept, 
the partial analysis stops just before the pass to be created. (See decision step 510.) In our 
simpUfied example, the partial analysis consists only of the tokenize pass 202, so that is what the 
generator program 1 06 executes. Partial analysis involves applying the partially built text analyzer, 
containing passes constructed so far, to the text containing the samples used to build the current pass. 
Partial analysis is conducted in an iterative fashion. Continuing to step 514, the generator program 
106 locates the position of the current sample within the parse tree that has been constructed so far 
(FIG. 3), based on the offsets of the sample within its entire text (step 512). Going back to the 
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example, "(949) 497-5318" has a start and end offset in the text that it appears in, for example 0 to 
1 3, if it is in the beginning of a text file. Looking in the parse tree later in the parse, the phrase that 
covers this range of offsets now appears "(949) _phone." Each part or node of the parse tree has a 
start and end offset. The "jphone" portion accounts for characters 6 through 13, The generator 
5 program 1 06, in building a complete phone rule, uses whatever phrase it finds in the parse tree 300 
at the range of offsets from 0 to 13. In the parse tree 300, the generator program 106 finds the tokens 

497 \- 5318 (3) 

1 0 It therefore generates [step 516] the raw rule based precisely on what is represented in the parse tree, 

0 as follows: 

1 _phone<-497 \- 5318 @@ (4) 

isJS The underscore before phone indicates that this is a non-literal concept. The <- arrow indicates a 
ri rewrite of the phrase to the right with the concept to the left. The @@ marker denotes the end of 
ill the rule. The backslash preceding the dash means that this dash is to be taken Uterally, rather than 
O being part of the rule language. At this point, the generator program 106 can attach labeling 
information to the first element ("497") and the last element ("5318") of the phrase, as prefix and 

20 suffix, as follows: 

_phone <- 497 [label-_prefix] \- 5318 [label=_suffix] @@ (5) 

Since there are no other samples (decision step 518) under the phone concept, the generator 
25 program 106 has no opportunity to merge and compare samples. Having finished with the samples 
under this rule concept, the generator program 106 at step 526 creates a new pass called phone for 
the rule set it has generated (consisting of one rule in this case). The generator program 106 then 
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adds the new pass to the analyzer sequence (step 528), as shown in FIG. 6. The generator program 
106 then looks for the next rale concept at step 530. Had there been additional rule concepts with 
samples in the sample hierarchy, (step 520) control would return to step 508, where the generator 
program 106 would have proceeded to analyze the overall text from which those samples derived. 
5 It would perform a partial analysis up to and including the pass called phone. For the given text, the 
resulting parse tree data stracture 700 is shown in FIG. 7. Note that the tokens "497", and 
"53 1 8" have been replaced in the parse tree by the single token "_phone". Now let us suppose that 
the user adds a second sample to the hierarchy, under a new concept, yielding the sample hierarchy 
shown in FIG. 8. 

1 0 Had the phone concept not been in this sample hierarchy, the generator program 1 06 would 

have built the rale 

i _completePhone <- \( 949 [label=areaCode] \) \ 497 \- 53 1 8 @@ (6) 

m But because the phone sample is also present and the generator program 1 06 has installed the phone 
15 p pass within the analyzer, the generator program 106 is given parse tree 550 (FIG, 7) when 
m constracting the rale for completePhone. Therefore, the generator program 1 06 builds the following 
O rale: 

_completePhone <- \( 949 [label=^areaCode] \) \ _phone @@ (7) 

20 The product of the prior automatically-generated pass is used in building the rales for the current 
pass called completePhone. The generator program 106 has now built an analyzer for phone 
numbers that follows the passes illustrated in FIG. 9. In this example, the phone pass and 
completePhone pass each contain one rale. This analyzer with two automatically generated passes 
produces the parse tree in FIG. 10 from the sample input text in (2). 

25 Generator program 106 automatically creates the passes of a text analyzer in stepwise 
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fashion, each time using the sequence of passes constructed so far in order to create the next pass of 
the analyzer. It adds each new pass to a backbone of manually built and previously generated passes. 

The discussion above describes the generation of one pass per rule concept. Additional 
modes, specified by the user who constructs the sample hierarchy, enable the rules generated for 
multiple rule concepts to be merged into a single large pass (step 524), in order to both optimize 
performance and to enable more sophisticated rule generation that identifies and unifies ambiguous 
constructs. For example, if "New York" is listed under a rule concept city and a rule concept state, 
then a unified treatment of these rule concepts can enable the generation of rules such as: 

_city [label=_state] <- New York @@ (8) 

which condenses instances of "New York" to both a city concept and a state concept in a parse tree. 

Optimizations 

Executing the generator program 106 can be computationally expensive, because each sample 
in the sample hierarchy requires the text containing it to be partially analyzed, in order to generate 
the rule corresponding to the sample. Generator program 106 can be modified to keep track of 
instances where multiple samples under a rule concept derive from the same text. In those cases, the 
given text need be partially analyzed only once, in order to glean the RAW rules for all the samples 
that derived from that text. 

In a preferred embodiment, further optimization may be achieved when generator program 
106 places user-added samples into a single sample file. Thus, each rule file has an associated 
sample file. The sample file may be stored in memory 104, disk drive 1 14 or CD Rom 116. In this 
way the number of partial text analyses is reduced for a sample hierarchy with many samples. 
Further optimizations are to generate passes when their complement of samples has changed. While 
there is a danger that some subsequent pass may not be updated correctly due to dependencies on 
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the current pass, most of the time this method of generation 

(generate dirty) is adequate for rapid development and testing. Occasionally, a generate all function 
may be invoked to rebuild every single pass, thus making sure that all passes that need updating will 
get updated. 

Rule Generahzation and Merging 

A preferred embodiment of the present invention also has the capabihty to generalize and 
merge raw rules generated directly from samples as illustrated in FIG. 5 at step 524. One sample 
is usually not sufficient to derive or generahze rules. At least two samples of any given pattern are 
required in order to deduce the more general pattem. When multiple samples are available under 
a rule concept, the rule generalization and merging method is invoked at step 524 to build a variety 
of rule sets: literal, general, optional, spht, and constrained. The hierarchy shown in FIG. 1 1 A, and 
the flow diagram in FIG. 1 IB, best describe the relationships among these rule sets. 

At step 560, for each raw rule generated (one per sample), the generator program 1 06 creates 
a general rule by iteratively generaUzing each element of the raw rule. For example, "497" will be 
generaUzed to NUMBER, "Home" will be generalized to ALPHABETIC, "-" to PUNCTUATION, 
and " " to WHITESPACE. At step 562, generator program 106 merges general rales that have 
identical elements and length. The general rale for "497-5318" will be identical to that for "555- 
1212," namely 

_phone <- _NUMBER _PUNCTUATION _NUMBER @@ (14) 
Therefore the rales for the two samples are merged under this general rale. The general rale retains 
a list of all the raw rales that gave rise to it. At step 564, generator program 1 06 traverses the general 
rales to build the split rales. The split rales require that all raw rales have consistent labeling. So 
a split rale may appear: 
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_phone <- _NUMBER [label=_prefix] _PUNCTUATION _NUMBER [label=_suffix] @@ (15) 
At step 566, generator program traverses the split rules to generate the constrained rules. 
Constrained rules are rules whose raw rules all have consistent features, such as length and 
capitalization. 

A constrained rule may appear: 

jphone <- 

_NIJMBER [label-_prefix length=3] 

_PUNCTUATION 

__NUMBER [label=_suffix length=4] 

@@ (16) 
The above rule constrains the first number to have three digits and the second number to have four 
digits. At step 566, generator program 106 creates a literal rule for every raw rule. The literal rule 
is constructed by looking "inside" each element of the phrase as deeply as can be seen in the parse 
tree. For example, if a raw rule appears 

_j)hone <- _LIST (NUMBER 497) \- _LIST (NUMBER 53 1 8) @@ (17) 
the literal rule produced is 

_phone <- 497 \- 5318 @@ (18) 
At step 570, generator program 106 creates optional rules by comparing the composition of general 
rules that differ by one element. If that element is not a labeled element, then the two general rules 
can be merged, with the difference element marked as optional. 

By embellishing a sample hierarchy with particular attributes, the manner in which rules are 
generated is controlled. The need to collect large sample sets in order to calculate statistically 
plausible generalizations is eliminated. Attributes may be specified to indicate what is to be 
generahzed, what is to be collected as a hst, and what is to be retained hterally. For example, one 
attribute may instruct the generator program 106 to always generaUze whitespace to a rule element 
that allows an arbitrary number of space characters. Another attribute may designate a label concept 
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as "closed," meaning any samples within it are to be collected into a list of only those samples, with 
no generalization. Other flags control the rule sets to be retained for the pass being generated. If the 
"constrain" flag is set to "true," then the constrained hst of rules is retained by the generator program 
1 06. Retaining a rule set involves placing it into the final list of rules for the pass under construction. 
An enhancement to the sample hierarchy is to enable the described attributes to control the way rules 
are generated for an entire subtree. If some concept within that subtree changes an attribute's value, 
then that new value controls its subtree, and so on recursively. 

A nonexhaustive set of attributes may be utilized to allow a user to control the rule sets to 
be retained in each pass of the analyzer, as below: 



Attribute Values 

GENERAL true/false 
SPLIT true/false 
CONSTRAINED true/false 
RAW true/false 

LITERAL true/false (19) 



The above attributes cause the generator program 1 06 to retain or discard the corresponding rule sets. 
For example, if a concept in the sample hierarchy has the constrained attribute set to true, then all 
the constrained rules generated in that subhierarchy will be retained as part of the final analyzer. An 
attribute called closed, also with true/false values, controls the way parts of samples are collected 
into rules. For example, given the samples 

497-5318 

555-1212 (20) 
if the closed attribute is set to true, then the corresponding constrained phone rule appears 

jhone <- _LIST (497 555) \- _LIST (5318 1212) @@ (21) 
That is, each element of the pattem is a "closed set," which collects any values found in the set of 
samples. If the CLOSED attribute is set to false, the constrained rule is 
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_phone <- 

_NUMBER [label=_prefix length=3] 

_PUNCTUATION 

^NUMBER [label=_suffix lengtli=4] 

@@ (22) 
Because white space and punctuation are often secondary in importance, a WHITE attribute with 
values true/false can specify that whitespace in samples generahzes to the rule element 

_WHITE [min-0 max=infinity] (23) 
that is, any number of white spaces, regardless of the particular type and number of whitespace 
characters in the set of samples. 

Other attributes can control the actions that get built for the generated rules and their 
components. For example, a QUICKSEM attribute with values true/false generates actions for 
semantic information to be copied automatically when a rule matches text. In the phone number 
example, the QUICKSEM attribute would cause the automatic creation of a data item called "prefix" 
with value "497" and a second data item called "suffix" with value "53 1 8" in the jphone node, given 
that the _j)hone rule matched a text string such as "497-5318." The LABEL (or LAYER) attribute 
takes a name as its value and leads to the generation of a label action in the associated rules that get 
generated. 

USER INTERFACE 

FIG. 12 illustrates a user interface 600 that allows a user to operate the generator program 
1 06. The left panel 602 displays the sample hierarchy 604 with phoneNumber concept 606 selected. 
The right panel 608 displays the rule file automatically generated for the phoneNumber concept. 
The partial hierarchy file Usting below details the commands for building the concepts of the sample 
hierarchy for a generator program 1 06 configured to process resumes. Each line builds one concept 
(the Usting does not distinguish among organizing concepts, rule concepts, and label concepts). 
Each concept can have an arbitrary number of samples assigned to it by a user. While most of the 
samples for the generator program 1 06 are smaller than a sentence (intrasentential), the method and 
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system of the present invention apply to paragraphs and sections of texts as well as to intrasentential 
patterns. 

"concept" "gram" "LiteralPhrase" "HeaderPhrase" 

"concept" "gram" "LiteralPhrase" "HeaderPhrase" "ContactHeaderPhrase" 
5 "concept" "gram" "LiteralPhrase" "HeaderPhrase" "ObjectiveHeaderPhrase" 

"concept" "gram" "LiteralPhrase" "HeaderPhrase" "EducationHeaderPhrase" 

"concept" "gram" "LiteralPhrase" "HeaderPhrase" "ExperienceHeaderPhrase" 

"concept" "gram" "LiteralPhrase" "HeaderPhrase" "SkillsHeaderPhrase" 

"concept" "gram" "LiteralPhrase" "HeaderPhrase" "PresentationsHeaderPhrase" 
10 "concept" "gram" "LiteralPhrase" "HeaderPhrase" "PublicationsHeaderPhrase" 
B "concept" "gram" "LiteralPhrase" "HeaderPhrase" "ReferencesHeaderPhrase" 
O "concept" "gram" "LiteralPhrase" "HeaderPhrase" "OtherHeaderPhrase" 

"concept" "gram" "LiteralPhrase" "Others" 
: ' "concept" "gram" "LiteralPhrase" "Others" "degreelnMajor" 
im "concept" "gram" "LiteralPhrase" "Others" "WebLinks" 

"concept" "gram" "LiteralPhrase" "Others" "emailHeader" 
13 "concept" "gram" "LiteralPhrase" "Others" "minorKey" 

"concept" "gram" "LiteralPhrase" "Caps" 

"concept" "gram" "LiteralPhrase" "Caps" "cityPhrase" 
20 "concept" "gram" "LiteralPhrase" "Caps" "statePhrase" 

"concept" "gram" "LiteralPhrase" "Caps" "companyPhrase" 

"concept" "gram" "LiteralPhrase" "Caps" "degreePhrase" 

"concept" "gram" "LiteralPhrase" "Caps" "countryPhrase" 

"concept" "gram" "LiteralPhrase" "Caps" "skillsPhrase" 
25 "concept" "gram" "LiteralPhrase" "Caps" "naturalLanguages" 
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"concept" "gram" "LiteralPhrase" "Caps" "software" 
"concept" "gram" "LiteralPhrase" "Caps" "hardware" 
"concept" "gram" "LiteralPhrase" "Caps" "certifications" 
"concept" "gram" "LiteralPhrase" "Caps" "field" 
5 "concept" "gram" "LiteralPhrase" "Caps" "Thesis" 
"concept" "gram" "LiteralPhrase" "Caps" "jobTitle" 
"concept" "gram" "LiteralPhrase" "Caps" "jobPhrase" 
"concept" "gram" "Word" 
"concept" "gram" "Word" "Syntax" 
10 "concept" "gram" "Word" "Syntax" "posPREP" 
1 "concept" "gram" "Word" "Syntax" "posDET" 
0 "concept" "gram" "Word" "Syntax" "posPRO" 
E "concept" "gram" "Word" "Syntax" "posCONJ" 
T "concept" "gram" "Word" "HeaderWord" 
im "concept" "gram" "Word" "HeaderWord" "ContactHeaderWord" 
H "concept" "gram" "Word" "HeaderWord" "ObjectiveHeaderWord" 
=3 "concept" "gram" "Word" "HeaderWord" "EducationHeaderWord" 
"concept" "gram" "Word" "HeaderWord" "ExperienceHeaderWord" 
"concept" "gram" "Word" "HeaderWord" "SkillsHeaderWord" 
20 "concept" "gram" "Word" "HeaderWord" "PresentationsHeaderWord" 
"concept" "gram" "Word" "HeaderWord" "PublicationsHeaderWord" 
"concept" "gram" "Word" "HeaderWord" "ReferencesHeaderWord" 
"concept" "gram" "Word" "HeaderWord" "OtherHeaderWord" 
"concept" "gram" "Word" "headerMod" 
25 "concept" "gram" "Word" "openPunct" 
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"concept" "gram" "Word" "closePunct" 

"concept" "gram" "Word" "resumeWord" 

"concept" "gram" "Word" "Present" 

"concept" "gram" "Word" "Direction" 
5 "concept" "gram" "Word" "adjDirection" 

"concept" "gram" "Word" "PostalUnit" 

"concept" "gram" "Word" "PostalRoad" 

"concept" "gram" "Word" "monthWord" 

"concept" "gram" "Word" "monthNum" 
10 "concept" "gram" "Word" "Season" 
:C "concept" "gram" "Word" "PostalState" 
y "concept" "gram" "Word" 'JobTitleRoot" 
g "concept" "gram" "Word" "jobMod" 
T "concept" "gram" "Word" "companyRoot" 
ism "concept" "gram" "Word" "companyModroot" 

"concept" "gram" "Word" "companyMod" 
O "concept" "gram" "Word" "ProgrammingLanguage" 

"concept" "gram" "Word" "cityMod" 

"concept" "gram" "Word" "cityWord" 
20 "concept" "gram" "Word" "Names" 

"concept" "gram" "Word" "Names" "femaleName" 

"concept" "gram" "Word" "Names" "maleName" 

"concept" "gram" "Word" "Names" "surName" 

"concept" "gram" "Word" "fieldName" 
25 "concept" "gram" "Word" "subOrg" 
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"concept" "gram" "Word" "softwareWord" 

"concept" "gram" "Phrase" 

"concept" "gram" "Phrase" "Contact" 

"concept" "gram" "Phrase" "Contact" "humanName" 
5 "concept" "gram" "Phrase" "Contact" "humanName" "prefixName" 

"concept" "gram" "Phrase" "Contact" "humanName" "firstName" 

"concept" "gram" "Phrase" "Contact" "humanName" "middleName" 

"concept" "gram" "Phrase" "Contact" "humanName" "lastName" 

"concept" "gram" "Phrase" "Contact" "humanName" "suffixName" 
10 "concept" "gram" "Phrase" "Contact" "cityStateZip" 
il "concept" "gram" "Phrase" "Contact" "cityStateZip" "cityName" 
O "concept" "gram" "Phrase" "Contact" "cityStateZip" "stateName" 
i5 "concept" "gram" "Phrase" "Contact" "cityStateZip" "zipCode" 
' "concept" "gram" "Phrase" "Contact" "cityStateZip" "zipSuffix" 
15 S "concept" "gram" "Phrase" "Contact" "cityStateZip" "country" 

"concept" "gram" "Phrase" "Contact" "cityState" 
i3 "concept" "gram" "Phrase" "Contact" "cityState" "cityName" 

"concept" "gram" "Phrase" "Contact" "cityState" "stateName" 

"concept" "gram" "Phrase" "Contact" "phoneExtension" 
20 "concept" "gram" "Phrase" "Contact" "phoneExtension" "extendWord" 

"concept" "gram" "Phrase" "Contact" "phoneExtension" "extension" 

"concept" "gram" "Phrase" "Contact" "phoneNumber" 

"concept" "gram" "Phrase" "Contact" "phoneNumber" "countryCode" 

"concept" "gram" "Phrase" "Contact" "phoneNumber" "areaCode" 
25 "concept" "gram" "Phrase" "Contact" "phoneNumber" "prefix" 
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"concept" "gram" "Phrase" "Contact" "phoneNumber" "suffix" 
"concept" "gram" "Phrase" "Contact" "phonePhrases" 
"concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneHomeFaxPhrase" 
"concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneHomeFaxPhrase" "HomeFax" 
5 "concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneWorkPhrase" 

"concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneWorkPhrase" "Work" 
"concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneFaxPhrase" 
"concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneFaxPhrase" "Fax" 
"concept" "gram" "Phrase" "Contact" "phonePhrases" "phonePagerPhrase" 
10 "concept" "gram" "Phrase" "Contact" "phonePhrases" "phonePagerPhrase" "Pager" 
"concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneCellPhrase" 
O "concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneCellPhrase" "Cell" 
!^ "concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneHomePhrase" 
r "concept" "gram" "Phrase" "Contact" "phonePhrases" "phoneHomePhrase" "Home" 
15ffl "concept" "gram" "Phrase" "Contact" "unitRoom" 
H "concept" "gram" "Phrase" "Contact" "unitRoom" "unit" 
C= "concept" "gram" "Phrase" "Contact" "unitRoom" "room" 
"concept" "gram" "Phrase" "Contact" "addressLine" 
"concept" "gram" "Phrase" "Contact" "addressLine" "streetNumber" 
20 "concept" "gram" "Phrase" "Contact" "addressLine" "streetName" 
"concept" "gram" "Phrase" "Contact" "addressLine" "road" 
"concept" "gram" "Phrase" "Contact" "addressLine" "direction" 
"concept" "gram" "Phrase" "Contact" "addressLine" "postdirection" 
"concept" "gram" "Phrase" "Contact" "addressLine" "POBox" 
25 "concept" "gram" "Phrase" "Contact" "email" 
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"concept" "gram" "Phrase" "SingleDate" "daySD" 

"concept" "gram" "Phrase" "SingleDate" "yearSD" 

"concept" "gram" "Phrase" "SingleDate" "seasonSD" 

"concept" "gram" "Phrase" "DateRange" 
5 "concept" "gram" "Phrase" "DateRange" "fromDate" 

"concept" "gram" "Phrase" "DateRange" "dateSep" 

"concept" "gram" "Phrase" "DateRange" "toDate" 

"concept" "gram" "Part" 

"concept" "gram" "Part" "addressPart" 
10 "concept" "gram" "fzrt" "educationPart" 
tfj "concept" "gram" "Part" "experiencePart" 

Machinery for Adding and Managing Samples 

While a command line interface may be utilized by an embodiment of the present invention, the 
1 5 if] preferred embodiment utilizes a graphical user interface (GUI) to manage the sample hierarchy. A 
specializedpuU-downmenuenablesrapidhighlightingandlabelingofsamplesandtheircomponents 
within a text. By selecting a concept in the sample hierarchy and then highlighting a text, the 
highlighted text sample is placed under the sample hierarchy concept, as in FIG. 1 3 . Once the user 
adds the overall sample, he can proceed to add labels (i.e., components of the overall sample), as 
20 illustrated in FIG. 14. As shown in FIG. 14, the user highlights and labels "Long Beach" as a 
cityName. 

In another aspect of the user interface of the present invention, a form tool 580 (FIG. 15) 
accelerates and organizes the addition of a sample and labeling of its components by enabhng a user 
to quickly group the textual components of a sample so that they will be properly labeled. Form tool 
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580 minimizes the need to use the mouse for highUghting a sample text and each of its components. 
Merely by clicking arrows in the form tool 580, the user can rearrange the components of the overall 
sample so that they are grouped and labeled properly. The form tool 580 can also serve as a locus 
for specifying information about a sample and for guiding the generation of rules and actions from 
5 the sample. 

Additional tools associated with the sample hierarchy are the Attribute Window and the 
Properties Window. The Properties Window 582 (FIG. 16) provides a structured way to control the 
mode of generating rules for a subhierarchy of the sample hierarchy. The Attribute Window 584 
(FIG. 17) is a lower level interface to attributes, enabling the user to add attributes that have not yet 
10 been incorporated into the Properties Window 582. 
=y Sample manager 586 is responsible for bookkeeping to track the file that each sample 

^3 originated from and the offsets of the sample and its labels within that file. The user may further 
?^ associate a sample file with any concept in the sample hierarchy. If the user creates such an 
association, then the system creates copies of samples, their labels, and their offsets in the sample 
1 5 Ji file. Sample files enable faster and more efficient generation of the text analyzer by minimizing the 
volume of text that must be analyzed to generate the rules for the analyzer. The sample manager 586 
O enables the user to perform functions such as associating a sample file, dissociating a sample file, 
opening the associated sample file, deleting the samples under a concept, and similar manipulations. 
FIG. 18 illustrates a menu incorporated into the sample manager 586 for managing samples and 
20 integrating them with the text analyzer development environment. The major capabilities available 
to the user in the sample manager 586 include: 
FUNCTION DESCRIPTION 

Add Concept Add a concept to the sample hierarchy, under selected concept. 

Add Top Concept Add a top-level concept to the sample hierarchy. 

25 Add Stub Add a top-level concept and link it to a region of the text analyzer 
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sequence. 

Delete Delete the selected item in the sample hierarchy. 

Delete Children Delete the children of the selected concept. 

Find Find the selected concept or sample by name. 

Associate Sample File Associate a sample file with the selected concept. 
Convert to Sample File Write the samples under selected concept to a sample file. 

Delete the samples under selected concept. 
Disassociate the sample file from the selected concept. 
Generate sample file for selected concept. 
10 Open Sample File Open a sample file for study or editing 

Bring up the Attribute Window. 

Show where a concept's rules have matched an analyzed text. 
Mark concept for quick generation of rules (i.e., generate dirty). 
Bring up the Properties Window. 

Edit or view the Rule File generated for the selected concept. 
View the parse tree due to selected concept's rules (for analyzed text). 
Store the sample hierarchy in the local archive. 
View the local archive. 
View the remote archive. 
20 Upload Grammar Store the sample hierarchy in the remote archive. 

The left panel 590 in FIG. 19 illustrates the automatically generated passes in the analyzer 
sequence, corresponding to the Phrase Stub 588 of FIG. 18. The right panel 592 shows the selected 
file, phoneNumber, within the stub region. 

We have described a system, method, and computer readable medium for generating text 
25 analyzers from samples. The users of a text analyzer need not understand how rules are generated 
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Delete Samples 
Disassociate File 
Generate Sample File 
Open Sample File 
Attributes 
HighUght Matches 
Mark for Generation 
Properties 
Rules 
View Tree 
Archive Grammar 
Local Archive 
Server Archive 
Upload Granmiar 



in order to maintain and enhance the capabilities of the text analyzer. Nonprogranuner and 
nonlinguist users can add samples that the text analyzer does not identify, in order to expand the 
processing power of the text analyzer. While the present invention has been illustrated and described 
in detail, it is to be understood that numerous modifications may be made to the preferred 
embodiment without departing from the spirit of the invention. 
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CLAIMS 

What is claimed is: 

1 . A method for generating a text analysis program for recognizing patterns appearing in text and 
extracting information from said patterns, the method comprising the steps of: 

(a) providing a sample hierarchy, said sample hierarchy comprising samples of text; 

(b) extracting at least one rule from said sample hierarchy, said rule describing how to process a 
portion of text; 

(c) generating a pass from said rule, said pass containing instructions to operate a text analyzer; and 

(d) constructing a text analyzer containing said pass. 

2. The method of claim 1 , wherein said rule is generaUzed into multiple rules and multiple passes. 

3. The method of Claim 1, wherein multiple passes are added to said text analyzer. 

4. The method of Claim 3, wherein said multiple passes are arranged in a cascading manner having 
a sequence of passes such that rules associated with a pass are appUed to subsequent passes. 

5. The method of Claim 1, wherein the samples are associated with offset values, said offset values 
identifying locations in a parse tree data structure, said parse tree containing concepts stored at 
locations identified by said offsets. 

6. The method of Claim 4, ftirther comprising the step of allowing a user to control the extraction 
of rules from the sample hierarchy 



7. The method of Claim 5, further comprising the step of allowing a user to designate properties 
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associated with said rules, said properties controlling rule generation for a portion of the sample 
hierarchy. 

8. The method of Claim 5, wherein said concepts are retrieved from said parse tree and processed 
to form said rule. 

9. The method of Claim 6, further comprising the step of allowing a user to designate attributes 
associated with said rules, said attributes guiding the apphcation of said rules. 

10. The method of Claim 1, wherein multiple rules are generahzed and merged into a single rule if 
there is a difference between the multiple rules. 

1 1 . The method of Claim 1 0, wherein said samples may be contained in a sample file. 

12. A sample hierarchy data structure for use in a text analyzer system, said sample hierarchy 
comprising an index for storing samples, said samples comprising portions of text, said samples used 
to generate rules for identifying pattems appearing in text, said samples used to derive information 
from said identified pattems, said rules generated by parsing said text samples, said index organized 
such that passes comprising operational steps and rules are generated in an order wherein simple 
pattems are recognized by said text analyzer, and said recognized simple pattems are used by said 
text analyzer system and used to iteratively recognize more complex pattems. 

13. A computer readable medium containing instructions which, when executed by a computer, 
generate a text analysis program for recognizing pattems appearing in text and extracting 
information from said pattems, by: 
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(a) providing a sample hierarchy, said sample hierarchy comprising samples of text; 

(b) extracting at least one rule from said sample hierarchy, said rule describing how to process a 
portion of text; 

(c) generating a pass from said rule, said pass containing instructions to operate a text analyzer; and 

(d) constructing a text analyzer containing said pass. 
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ABSTRACT 

A system, method, and computer program for automatically generating text analysis systems is 
disclosed. Individual passes of a multi-pass text analyzer are created by generating rules from 
samples supplied by users. Successive passes are created in a cascading fashion by performing 
partial text analyses employing existing passes. A complete text analyzer interleaves the generated 
passes with a framework of existing passes. The complete text analysis system can then process 
texts to identify patterns similar to samples added by users. Generation of rules from samples 
encompasses a wide range of constructs and granularities that occur in text, from individual words 
to intrasentential patterns, to sentential, paragraph, section, and other formats that occur in text 
documents. 
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DECLARATION FOR PATENT APPLICATION 



As a below-named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name, 

I believe I am the original, first and sole inventor (if only one name is listed below) or an original, first and joint 
inventor (if plural names are Hsted below) of the subject matter which is claimed and for which a patent is sought 
on the invention entitled AUTOMATED GENERATION OF TEXT ANALYSIS SYSTEMS, the specification 
of which: 

is attached hereto. 

was filed on as 

Application Serial No. 

and was amended on . 

(if applicable) 

I hereby state that I have reviewed and understand the contents of the above-identified specification, including the 
claims, as amended by any amendment referred to above. 

I acknowledge the duty to disclose to the Patent Office all information known to me to be material to patentability 
as defined in 37 C.F.R. 1.56. 

I hereby claim foreign priority benefits under Title 35, United States Code, § 1 19 of any foreign application(s) for 
patent or inventor's certificate listed below and have also identified below any foreign apphcation for patent or 
inventor's certificate having a filing date before that of the application on which priority is claimed: 

Prior Foreign Application(s) 



(Number) (Country) (Day/Month/Year Filed) 

I hereby claim the benefit under Title 35, United States Code, § 119(e) of any United States provisional 
application(s) listed below: 



(Apphcation Serial No.) (Filing Date) (Status) 

(patented, pending, abandoned) 

I hereby claim the benefit under Title 35, United States Code, § 120 of any United States application(s) Hsted below 
and, msofar as the subject matter of each of the clauns of this application is not disclosed in the prior United States 
application m tiie manner provided by the first paragraph of Title 35, United States Code, § 1 12, 1 acknowledge the 
duty to disclose to the Patent Office all information known to me to be material to patentabiUty as defined in 37 
CF.R. 1.56 which occurred between the filing date of the prior application and the national or PCT international 
filing date of this application: 



(Application Serial No.) (Filing Date) (Status) 

(patented, pending, abandoned) 



(check one) 



□ 



Prioritv Claimed 

□ □ 

Yes No 
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Direct all telephone calls to Aldo J. Test at (650) 494-8700. 



Address all correspondence to: 

FLEIIR HOHBACH TEST 
ALBRITTON & HERBERT LLP 
Suite 3400, Four Embarcadero Center 
San Francisco, California 941 11 

File No. A-68807/AJT/JWC 

I hereby declare that all statements made herein of my own knowledge are true and that all statements made on 
information and belief are believed to be true; and further that these statements were made with the knowledge that 
willful false statements and the like so made are punishable by fine or imprisonment, or both, mider Title 1 8, United 
States Code, § 1001 and that such willful false statements may jeopardize the validity of the application or any patent 
issued thereon. 

Full name of first or sole 
inventor: 

Inventor's signature: 
Date: 

Residence: Laguna Beach, California 

Citizenship: USA 

Post Office Address: 1261 Starlit Drive 

Lagima Beach, California 9265 1 
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FROM: Avi Meyers To: John Crosby Date: 06/Z6/a000 Tjme: 9:02:5-4 PM Page 3 of 3 
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POWER OF ATTORSEY BY ASSIGNEE 
ANI> EXOLUSION OF SNVENTOR/S UNBER 37 C^JSL §1.32 

To the Ccnnmissicmer of Pioeaxta and Tradeniscks: 

Tbe uzuiflraisiiod assisiifie of the oodre intoareat in i^iplicsticKn JSar Isttcn patent sotnnitted 
hev^wifh exitifl«d AUTOMATED GENERATION OF TEXT ANAJLVSIS SYSTEMS, Sled 
hemvftBoL and Imvine ^ &«nnued inventxH^s): 

Avf Meym 

hereby j^ypoixUs the following attoooeys to prosecute this applioasi<m tnd to traiisact all tmsiness 
in tbe P Btent and 'Hademailc Office conaeeted therewith; sidd i^Tpointmeat to be to the exchiaioii 
of the mve2Xtoi(s) and his (their) attomey(A) hi acootdance with the provisians of 3? C.F^ 1 . 32: 
nnoldC.HoUb«cli,Rc8.No. 17,757; Aldo J, Teat, Reg, No. l8.0«;I>oi!aMN.M«cIiitcwh, Reg- 
No. 20,316* Bdwaid S* Wright^ R«g. No. 24,903; David J. BfereKifflr, Reg, No. 24,774; Richacd E. 
Backw, Reg. No. 22,701; Jsunss A. Shecidsn, Reg. No. 25,435; ^bslbcn B. Chlokeriag, Reg. 
No. 24;286; Richard F, TrBCttiia»Ri^.Nb. 31.801 ; StevenF. Casorza, Keg.No.^»78(HMl<^sae1 A, 
Kwfknan, R^. No. 32,988; Bdward N. Bxohaad^ Keg. No. 37,085; R. Michael Ananiaz^ Reg. 
No, 35,050; Stcphm M- Knaueir, Reg. No. 38,;^; Ri^bixi M, Silva, Reg. No. 38^04; I>avid C. 
Asbhy, Rc®. No. 36,432; Maria S, Swuatek, Reg. No. 37,244, I>oHy A, Vanee, Reg. No. 39,054, 
Jufian O^lam Reg. No. 1 4,785» Brian G. Hart, Reg. Mb. 44.421 ; Steven M. Freelandj, Reg. No. 
4%555; WaUam E. Nirttie; Reg. No. 42^943; and Victor Johnson, No. 41 ,546, provided 
that if any one of said attorneys ceas^ being af^liatsd wi^Hie law iixm of Flelor Hohbach Test 
Albritton&HesbertULP aa partner, osployeeor of counsel,, such EfttoEXBeyc ajspodiatmeat aaa»Q«m^ 
ap^ ^11 prywffra ^«rKi^ flw^^ftyww «hfl1l fiwttfaatw m tha data <mcha±tQmgvcaasgs being afflKatcd, 

Ohect all telephone calls to AldoXT«8t at (650) 4^8700. Address all eotrespondexTO 

to: 

FI.EHR HOHBACH TEST 
AUSRirrON & HKSfeBERT ULP 
Suite 3400, Four ^ol^coadeio Center 
San F^rancisco, Cali&xnia 941 1 1*4187 

Text Analysis Iiiteiztational Ino.» a Cozpotadon of the State of CaliSnzua, c^tifLea that 
it is the aaaigiiee of the entire risJst, title and intsr&gt in the p a^eot appHc atum idcutifiai flibove by 
virtue of an aastgnment &om the inveaotor of &e patent applieadon identiiled above. A copy of 
liie executed, unieeorded assignment iaattai^^ 'IlietindflKaifi^aed (whose tide is ffoppHed 

below) ia empowered to aet on behalf of Ihe assignee. 



TEXT ANAI.YSIS INTBRNATTC^AU INC 

Avi Meym, CEO ^ 
Teact Analyras Intcnu^onalv 
1604 Mariflxd Orive 
Sunnyvale, California 5M087 
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